USERS table statistical analysis¶

Table of Contents

  • 1  Introduction
  • 2  Main Users table analysis
    • 2.1  Numerical data
      • 2.1.1  Summary statistics
      • 2.1.2  Normal distrubution evaluation
      • 2.1.3  Distribution detection
      • 2.1.4  Outliers detection
    • 2.2  Categorical and boolean data
      • 2.2.1  Categorical data
      • 2.2.2  Boolean data
    • 2.3  Evaluation of dataset only with valid data (according to numeric variables)
      • 2.3.1  Numeric data
        • 2.3.1.1  Summary statistic
        • 2.3.1.2  Normal distrubution evaluation
        • 2.3.1.3  Data distribution
        • 2.3.1.4  Outliers detection
      • 2.3.2  Categorical and boolean data
        • 2.3.2.1  Categorical data
        • 2.3.2.2  Boolean data
  • 3  Scientific usage of data agreement
    • 3.1  Numerical data
      • 3.1.1  Summary statistics
      • 3.1.2  Normal distribution evaluation
      • 3.1.3  Data distribution
      • 3.1.4  Outliers detection
    • 3.2  Categorical and boolean data
      • 3.2.1  Categorical data
      • 3.2.2  Boolean data
    • 3.3  Evaluation of subset only with valid data (from numeric data)
      • 3.3.1  Numerical data
        • 3.3.1.1  Summary statistics
        • 3.3.1.2  Normal distribution evaluation
        • 3.3.1.3  Data distribution
        • 3.3.1.4  Outliers detection
      • 3.3.2  Categorical and boolean data
        • 3.3.2.1  Categorical data
        • 3.3.2.2  Boolean data
  • 4  User_achievements table
    • 4.1  User_achievement table characteristics
    • 4.2  Connection to Users table
  • 5  User_programs table
    • 5.1  User_programs table characteristics
    • 5.2  Connection to Users table of completed programs
      • 5.2.1  Agreement to scientific data usage
        • 5.2.1.1  All of the data
        • 5.2.1.2  Completed programs

Introduction¶

In this document there will be statistical analysis of Users table. Users table has 79 variables (columns) and 18688 records. It contains demographic and partly-usage data of all of the users. For the analysis we will be using libraries:

  • pandas
  • numpy
  • statistics
  • matplotlib.pyplot
  • seaborn
  • pingouin
  • distfit

The column names are:

Index(['id', 'email', 'encrypted_password', 'reset_password_token',
       'reset_password_sent_at', 'remember_created_at', 'created_at',
       'updated_at', 'gender', 'date_of_birth', 'height', 'weight',
       'activity_level', 'goal', 'body_type', 'body_fat',
       'newsletter_subscription', 'is_admin', 'names', 'last_name',
       'sign_in_count', 'current_sign_in_at', 'last_sign_in_at',
       'current_sign_in_ip', 'last_sign_in_ip', 'recover_password_code',
       'recover_password_attempts', 'facebook_uid',
       'workout_setting_voice_coach', 'workout_setting_sound',
       'workout_setting_vibration', 'workout_setting_mobility',
       'workout_setting_cardio_warmup', 'workout_setting_countdown',
       'notifications_setting', 'training_days_setting', 'google_uid',
       'language', 'country', 'points', 'scientific_data_usage', 't1_push',
       't1_core', 't1_legs', 't1_full', 't1_push_exercise', 't1_pull_up',
       't2_reps', 't2_steps', 't2_reps_push', 't2_reps_core', 't2_reps_legs',
       't2_reps_full', 't2_time_push', 't2_time_core', 't2_time_legs',
       't2_time_full', 't1_full_exercise', 't1_pull_up_exercise',
       'warmup_setting', 'warmup_session_id', 'stripe_id', 'provider', 'uid',
       'best_weekly_streak', 'current_weekly_streak', 'affiliate_code',
       'affiliate_code_signup', 'total_sessions', 'total_time',
       'kcal_per_session', 'reps_per_session', 'moengage_id', 'mix_panel_id',
       'apple_id_token', 'imported', 'platform', 'login_token',
       'login_token_generated_at'],
      dtype='object')

For analysis multiple columns will be omitted due to data sensivity and irrelevance.

The columns that will be analyzed are:

  • id - unique number for every user,
  • created_at,
  • updated_at,
  • gender - 0 – female, 1 – male,
  • date_of_birth,
  • height - in centimeters,
  • weight - in kilograms,
  • activity_level - 0 – very active, 1 – active, 2 - sedentary,
  • goal - 0 – lose, 1 – gain,
  • body_type - 0 – thin, 1 – mid, 2 – strong,
  • body_fat - value in percent
  • newsletter_subscription - boolean value
  • notifications_setting - boolean value
  • training_days_setting - selected number of days a week to workout
  • language - Spanish/English
  • country - 2 letter country code (Alpha-2)
  • points - points collected by user
  • scientific_data_usage - boolean value, agreement for usage of data
  • best_weekly_streak - number of weeks when user accomplished "weekly goal" (all of the set training days)
  • affiliate_code_signup - code name if someone signed up by affiliate link
  • total_sessions - number of total sessions completed (? to be updated)
  • total_time - number of total spent time during sessions in seconds,
  • kcal_per_session - number of burned calories in one session,
  • reps_per_session - average number of reps per session,
  • BMI - column with body mass index number, added for analysis,
  • BMI_category - category of BMI (BMI<18.5 - underweight, 18.5 $\leq$ BMI $<$25 - normal weight, 25 $\leq$ BMI $<$30 - overweight, BMI $\geq$30 - Obesity), also added.

Types of variables had to be changed to the suitable ones. Also, in the categorical variables (in most places) the numbers were replaced with string factors. Summary of nulls and data types are given below.

In total, there are 18688 observations.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18688 entries, 0 to 18687
Data columns (total 27 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   id                       18688 non-null  category      
 1   created_at               18688 non-null  datetime64[ns]
 2   updated_at               18688 non-null  datetime64[ns]
 3   gender                   18688 non-null  category      
 4   date_of_birth            18688 non-null  datetime64[ns]
 5   height                   18688 non-null  float64       
 6   weight                   18688 non-null  float64       
 7   activity_level           18688 non-null  category      
 8   goal                     18688 non-null  category      
 9   body_type                18688 non-null  category      
 10  body_fat                 18688 non-null  float64       
 11  newsletter_subscription  18688 non-null  bool          
 12  notifications_setting    18688 non-null  bool          
 13  training_days_setting    18688 non-null  bool          
 14  language                 18688 non-null  category      
 15  country                  6352 non-null   category      
 16  points                   18688 non-null  int64         
 17  scientific_data_usage    18688 non-null  bool          
 18  best_weekly_streak       18688 non-null  int64         
 19  affiliate_code_signup    867 non-null    category      
 20  total_sessions           3640 non-null   float64       
 21  total_time               3640 non-null   float64       
 22  kcal_per_session         3640 non-null   float64       
 23  reps_per_session         3640 non-null   float64       
 24  height[m]                18688 non-null  float64       
 25  BMI                      18522 non-null  float64       
 26  BMI_category             18522 non-null  category      
dtypes: bool(4), category(9), datetime64[ns](3), float64(9), int64(2)
memory usage: 2.9 MB

Main Users table analysis¶

The data was split into numerical and categorical/boolean data.

Numerical data¶

The variables taken as numerical data are:

  • height,
  • weight,
  • body_fat,
  • points,
  • best_weekly_streak,
  • total_sessions,
  • total_time,
  • kcal_per_session,
  • reps_per_session,
  • BMI.

Summary statistics¶

Table with summary statistics (mean, standard deviation, minimum, maximum, quartiles, variance, skewness, kurtosis and NULL count) are given below.

count mean std min 25% 50% 75% max var skewness kurtosis NULL count
height 18688.00 169.67 23.09 0.00 164.00 171.00 178.00 1780.00 533.18 19.35 1475.26 0
weight 18688.00 73.16 15.84 22.00 62.00 72.00 82.00 277.00 250.79 1.33 7.19 0
body_fat 18688.00 24.28 8.60 2.00 20.00 25.00 30.00 80.00 73.93 0.66 0.40 0
points 18688.00 19478.15 93727.46 0.00 0.00 100.00 5047.00 2749450.00 8784837124.08 13.16 251.82 0
best_weekly_streak 18688.00 0.85 3.21 0.00 0.00 0.00 0.00 49.00 10.29 7.12 66.85 0
total_sessions 3640.00 18.79 35.61 1.00 2.00 5.00 19.00 922.00 1268.12 7.26 128.14 15048
total_time 3640.00 23281.25 45236.45 0.00 1539.50 5115.50 21869.00 622509.00 2046336681.33 3.94 23.48 15048
kcal_per_session 3640.00 48.99 144.33 0.00 5.15 24.08 68.00 4147.00 20830.65 19.22 461.92 15048
reps_per_session 3640.00 10355.27 575130.19 0.00 11.00 45.00 124.00 34597012.00 330774735472.08 59.82 3597.13 15048
BMI 18522.00 24.92 4.47 0.27 22.05 24.22 26.87 87.62 19.97 1.54 7.54 166

There are 18688 users in the data table, that means 18688 people installed and signed up to the application. Among the users median height is 171 cm (with IQR 164-178), mean height is 169.67 cm (SD 23.09) and maximum height is 1780 cm. Median weight is 72 kg (IQR 62-82), where minimum is 22 kg and maximum 277 kg. Mean weight is 73.16 kg (SD 15.84). Median and mean body fat are respectively 25% (IQR 20% - 30%) and 24.28% (SD 8.6), while minimum given body fat is 2% and maximum 80%. Median value of points is 100 (IQR 0 - 5047), maximum is 2749450 and mean is 19478.15 (SD 93727.46). Best_weekly_streak among all of the users is 49 weeks, median is 0 IQR (0 - 0) and mean is 0.85 (SD 3.21). Median and mean values of total_session are respectively 5 (IQR 2 - 19) and 18.79 (SD 35.61). Maximum value is 922 sessions. Median total_time (that is in minutes?) is 5115.5 (IQR 1539.5 - 21869), mean is 23281.25 (SD 45236.45) and the minimum and maximum value are respectively 0 and 622509. Average value of burned kilo calories per session (for every user separately) has median 24.08 kcal (IQR 5.15 - 68) and the mean is 48.99 kcal (SD 144.33). Average number of reps per session (for every user separately) has median 45 (IQR 11 -124), maximum value is 34597012 and mean is 10355.27 (SD 575130.19). (There are extreme outliers here) Median BMI is 24 - normal weight group (IQR 22-27), minimum is 0.27 (probably a mistake made by user), mean BMI value is 25 - overweight (SD 4) and the maximum is 88 (probably also a mistake made by user - extreme outlier).

Normal distrubution evaluation¶

It is seen, that there are a lot of outliers in the data (maybe some of them could be a mistake while inserting data - human error).

To see if continous data is normally distiruted, histograms, qqplots and shapiro test was used. All of them are given below.

Text(0.5, 0.98, 'Histogram plots for all numeric variables')
Text(0.5, 1.05, 'QQ plots for all numeric variables')
height weight body_fat points best_weekly_streak total_sessions total_time kcal_per_session reps_per_session BMI
W 0.35 0.94 0.95 0.19 0.28 0.51 0.54 0.20 0.01 0.91
pval 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
normal False False False False False False False False False False

There is no normality in continous data. From the plots, the only variable suspected for normality is body fat, but Shapiro-Wilk test shows that there is no normality in data.

Looking at skewness and kurtosis (from summary statistics), it is also seen that there is no normal distribution in the data.

Reminder:

  • Kurtosis:

    • if kurtosis is equal to 3, then there is normal distribution,
    • if kurtosis > 3, then there is playkurtic,
    • if kurtosis < 3, then there is leptokurtic and it signifies that it tries to produce more outliers rather than the normal distribution,
  • Skewness:

    • if skewness is equal to 0, then data is normally distributed,
    • if skewness > 0, then more weight in the left tail of the distribution,
    • if skewness < 0, then more weight in the right tail of the distribution.

Distribution detection¶

It is possible to check from which distribution data can come from (or is the closest to). Here will be used distfit function from distfit package. Every variable will be checked separately. The criterion of determination for best fit is RSS (residual sum of squares). The RSS describes the deviation predicted from actual empirical values of data. A small RSS indicates a tight fit of the model to the data. RSS is computed by

$$ RSS = \sum_{i=1}^{n} \left(y - f(x_i)\right)^2 $$

where $y_i$ is the i-th value of the variable to be predicted, $x_i$ is the i-th value of the explanatory variable, and $f(x_i)$ is the predicted value of $y_i$ (also termed as $\hat{y_i}$). (Source: https://erdogant.github.io/distfit/pages/html/Parametric.html) In the analysis will be shown top $5$ best fits for each variable. They will be shown on a plot with value of RSS.

  • height - the best distribution is exponential normal distribution (RSS = 0.0000044),
[distfit] >fit..
[distfit] >transform..
[distfit] >[exponnorm] [0.16 sec] [RSS: 4.42371e-06] [loc=164.793 scale=18.300]
[distfit] >[t        ] [0.82 sec] [RSS: 1.6207e-05] [loc=168.602 scale=20.232]
[distfit] >[hypsecant] [0.05 sec] [RSS: 1.75498e-05] [loc=170.913 scale=7.824]
[distfit] >[betaprime] [0.47 sec] [RSS: 1.7312e-05] [loc=-704.847 scale=2062.608]
[distfit] >[logistic ] [0.01 sec] [RSS: 1.81518e-05] [loc=170.845 scale=6.554]
[distfit] >Compute confidence interval [parametric]
[distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
 <AxesSubplot:title={'center':'Best fit: exponnorm'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
  • weight - the best distribution is exponential normal again (RSS = 0.000045),
[distfit] >fit..
[distfit] >transform..
[distfit] >[exponnorm  ] [0.12 sec] [RSS: 4.89187e-05] [loc=61.338 scale=10.128]
[distfit] >[gengamma   ] [0.85 sec] [RSS: 4.95912e-05] [loc=7.526 scale=0.225]
[distfit] >[t          ] [0.83 sec] [RSS: 5.2237e-05] [loc=72.151 scale=13.086]
[distfit] >[logistic   ] [0.00 sec] [RSS: 5.3337e-05] [loc=72.211 scale=8.539]
[distfit] >[tukeylambda] [14.5 sec] [RSS: 5.35242e-05] [loc=72.272 scale=8.754]
[distfit] >Compute confidence interval [parametric]
[distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
 <AxesSubplot:title={'center':'Best fit: exponnorm'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
  • body_fat - the best distribution is dgamma (RSS = 0.0404),
[distfit] >fit..
[distfit] >transform..
[distfit] >[dgamma     ] [0.05 sec] [RSS: 0.0403587] [loc=22.782 scale=3.615]
[distfit] >[dweibull   ] [0.08 sec] [RSS: 0.0414598] [loc=23.232 scale=7.570]
[distfit] >[genlogistic] [0.13 sec] [RSS: 0.0419564] [loc=5.779 scale=6.750]
[distfit] >[invweibull ] [0.73 sec] [RSS: 0.0419672] [loc=-586064097.315 scale=586064117.542]
[distfit] >[gumbel_r   ] [0.00 sec] [RSS: 0.0419726] [loc=20.249 scale=7.190]
[distfit] >Compute confidence interval [parametric]
[distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
 <AxesSubplot:title={'center':'Best fit: dgamma'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
  • points - the best distribution is half logistic (RSS = 0.00000000000058),
[distfit] >fit..
[distfit] >transform..
[distfit] >[halflogistic   ] [0.16 sec] [RSS: 5.80151e-13] [loc=-0.000 scale=17664.683]
[distfit] >[genhalflogistic] [0.42 sec] [RSS: 9.03931e-13] [loc=-30.607 scale=18800.076]
[distfit] >[gompertz       ] [0.25 sec] [RSS: 1.37191e-11] [loc=-0.000 scale=29654204083238240.000]
[distfit] >[expon          ] [0.00 sec] [RSS: 1.89989e-11] [loc=0.000 scale=19478.150]
[distfit] >[pareto         ] [0.01 sec] [RSS: 1.89989e-11] [loc=-4398046511103.998 scale=4398046511103.998]
[distfit] >Compute confidence interval [parametric]
[distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
 <AxesSubplot:title={'center':'Best fit: halflogistic'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
  • best_weekly_streak - the best fitted distribution is lomax (also called pareto type II) (RSS = 0.319),
[distfit] >fit..
[distfit] >transform..
[distfit] >[lomax   ] [0.12 sec] [RSS: 0.0319116] [loc=-0.000 scale=3.643]
[distfit] >[gompertz] [0.26 sec] [RSS: 0.0411965] [loc=-0.000 scale=5457980424427.981]
[distfit] >[expon   ] [0.00 sec] [RSS: 0.0431919] [loc=0.000 scale=0.851]
[distfit] >[pareto  ] [0.01 sec] [RSS: 0.0431919] [loc=-134217728.000 scale=134217728.000]
[distfit] >[genexpon] [1.58 sec] [RSS: 0.0431971] [loc=-0.000 scale=1.709]
[distfit] >Compute confidence interval [parametric]
[distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
 <AxesSubplot:title={'center':'Best fit: lomax'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)

For

  • total_sessions,
  • total_time,
  • kcal_per_session,
  • reps_per_session,
  • BMI

the distribution cannot be compute, because of the NULL values. It will be fitted in a later part of this document.

Most frequent distributions are exponentially modified Gaussian distribution (exponnorm) and half logistic, then there are dgamma, truncated normal and lomax.

Outliers detection¶

Detection of outliers can be done, when the 'right' definition of outlier will be chosen and applied. It is not done in this analysis.

"\n\nExample of outlier detection and deleting (taking whole dataset into consideration)\nExample for height:\n\n1. we look at the distribution plot of “height” feature\nsns.distplot(num_table['height'])\n\n2. We look at the box-plot of “height” feature\nsns.boxplot(num_table['height'])\n\n3. We calculate 99% and 1% quantile of height\nupper_limit = num_table['height'].quantile(0.99)\nlower_limit = num_table['height'].quantile(0.01)\n\n4. Apply trimming\nnew_num_table = num_table[(num_table['height'] <= upper_limit) & (num_table['height'] >= lower_limit)]\n\n5. Compare the distribution and box-plot after trimming\n\nsns.distplot(new_num_table['height'])\nsns.boxplot(new_num_table['height'])\n\nWinsorization :\n\n6. Apply Capping(Winsorization)\n\nnum_table['height'] = np.where(num_table['height'] >= upper_limit,\n        upper_limit,\n        np.where(num_table['height'] <= lower_limit,\n        lower_limit,\n        num_table['height']))\n\n7. Compare the distribution and box-plot after capping\n\nsns.distplot(num_table['height'])\nsns.boxplot(num_table['height'])\n\n"

Categorical and boolean data¶

Categorical data¶

The variables taken as categorical are:

  • gender,
  • activity_level,
  • goal,
  • body_type,
  • language,
  • country,
  • affiliate_code_signup,
  • BMI_category.
Frequency tables¶

Data can be looked through frequency tables with percentages that are shown below.

Frequency Percent Cumulative Percent
Variable factors
Gender
female 7771.00 41.58% 41.58%
male 10917.00 58.42% 100.0%
Total 18688.00 100.0% -
Activity_level
very active 2168.00 11.6% 11.6%
active 9728.00 52.05% 63.66%
sedentary 6792.00 36.34% 100.0%
Total 18688.00 100.0% -
Goal
lose 8257.00 44.18% 44.18%
gain 7838.00 41.94% 86.12%
antiaging 2593.00 13.88% 100.0%
Total 18688.00 100.0% -
Language
en 1245.00 6.66% 6.66%
es 17443.00 93.34% 100.0%
Total 18688.00 100.0% -
Body_type
thin 7653.00 40.95% 40.95%
mid 8791.00 47.04% 87.99%
strong 2244.00 12.01% 100.0%
Total 18688.00 100.0% -
BMI_category
Normal 10528.00 56.84% 56.84%
Obesity 2145.00 11.58% 68.42%
Overweight 5336.00 28.81% 97.23%
Underweight 513.00 2.77% 100.0%
Total 18522.00 100.0% -

From the cumulated frequency tables it is seen, that in those categorical variables there are no NULLs. Females are 42% and male are 58% of the users population. Over half of the users set their activity level as active (52%). Much less users decided that their activity is sedentary (36%) and very active (12%). Similar number of users decided that their goal would be losing weight (44%) or gaining weight (42%), the smallest group (14%) decided for antiaging goal. 93% of users chose Spanish language and 7% chose English. Most of the responders decided that their body type is mid (47%), then thin (41%) and the smallest group is strong (12%). Almost 57% of users have weight in normal, 11% have obesity, 29% are overweight and 3% are underweight.

Total ES US AR MX CL DE GB FR CO ... HR LB KG DZ ET EU RS GG HU LU
Frequency 6352 4866 273 219 186 136 69 67 64 50 ... 1 1 1 1 1 1 1 1 1 1
Percent 100.0% 76.61% 4.3% 3.45% 2.93% 2.14% 1.09% 1.05% 1.01% 0.79% ... 0.02% 0.02% 0.02% 0.02% 0.02% 0.02% 0.02% 0.02% 0.02% 0.02%

2 rows × 80 columns

From the countries chosen by users, the most frequent one was Spain (77%). Then was big 'drop' and USA (4%) and Argentina (3%). That explains why so many users chose Spanish as main language of the app. There is a lot of NAs in chosen country, because total of frequency counts is 6352 out of 18688, that means that 34% of users decided to choose a country of living.

Total endika mariapelazas fitness_revolucionario mammothhunters lifestyle_con_blanca keto_aove gloria_martinez cristinamanyer martina_ferrer_ ... nicotononpt pablo_kuhnert maria_mendoza_a Anavb87 lilifitme janetgzzl fullmusculo eat2winmedia anabel_freyes healthybyjane
Frequency 867 271 108 97 83 77 53 44 37 23 ... 1 1 1 1 1 1 1 1 1 1
Percent 100.0% 31.26% 12.46% 11.19% 9.57% 8.88% 6.11% 5.07% 4.27% 2.65% ... 0.12% 0.12% 0.12% 0.12% 0.12% 0.12% 0.12% 0.12% 0.12% 0.12%

2 rows × 28 columns

Only 867 of users (5% of all users) used affiliate code for sign up. Most frequent one was endika (31%), mariapelazas (12%) and fitness_revolucionario (11%).

Boolean data¶

The variables taken as boolean are:

  • newsletter_subscription,
  • notifications_setting,
  • training_days_setting,
  • scientific_data_usage.
Frequency tables¶

Data can be looked through frequency tables with percentages after converting it to categorical values.

Frequency Percent
Variable factors
scientific_data_usage
False 12830.00 68.65%
True 5858.00 31.35%
Total 18688.00 100.0%
newsletter_subscription
False 5230.00 27.99%
True 13458.00 72.01%
Total 18688.00 100.0%
notifications_setting
False 107.00 0.57%
True 18581.00 99.43%
Total 18688.00 100.0%
training_days_setting
True 18688.00 100.0%
Total 18688.00 100.0%

From the boolean data, only 31% of users agreed on scientific usage of their data. 72% of users agreed on newsletter subscription, 99% agreed on notifications setting and all of them chose to set training days setting.

Evaluation of dataset only with valid data (according to numeric variables)¶

Looking at numeric data, there is only 3621 valid observations. In this valid data (according to numeric variables) there is only 1098 valid country observations, 2308 valid current_last_sign_in and last_sign_in_at observations and 115 valid observations of affiliate_code_signup.

Below there is a barplot with with count of null data in every numeric variable and information about every variable (type, no-NULL count).

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3621 entries, 1 to 18660
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   id                       3621 non-null   category      
 1   created_at               3621 non-null   datetime64[ns]
 2   updated_at               3621 non-null   datetime64[ns]
 3   gender                   3621 non-null   category      
 4   date_of_birth            3621 non-null   datetime64[ns]
 5   height                   3621 non-null   float64       
 6   weight                   3621 non-null   float64       
 7   activity_level           3621 non-null   category      
 8   goal                     3621 non-null   category      
 9   body_type                3621 non-null   category      
 10  body_fat                 3621 non-null   float64       
 11  newsletter_subscription  3621 non-null   bool          
 12  sign_in_count            3621 non-null   int64         
 13  current_sign_in_at       2308 non-null   datetime64[ns]
 14  last_sign_in_at          2308 non-null   datetime64[ns]
 15  notifications_setting    3621 non-null   bool          
 16  training_days_setting    3621 non-null   bool          
 17  language                 3621 non-null   category      
 18  country                  1098 non-null   category      
 19  points                   3621 non-null   int64         
 20  scientific_data_usage    3621 non-null   bool          
 21  best_weekly_streak       3621 non-null   int64         
 22  current_weekly_streak    3621 non-null   int64         
 23  affiliate_code_signup    115 non-null    category      
 24  total_sessions           3621 non-null   float64       
 25  total_time               3621 non-null   float64       
 26  kcal_per_session         3621 non-null   float64       
 27  reps_per_session         3621 non-null   float64       
 28  height[m]                3621 non-null   float64       
 29  BMI                      3621 non-null   float64       
 30  BMI_category             3621 non-null   category      
dtypes: bool(4), category(9), datetime64[ns](5), float64(9), int64(4)
memory usage: 1.2 MB

Numeric data¶

The variables taken as numerical data are:

  • height,
  • weight,
  • body_fat,
  • points,
  • best_weekly_streak,
  • total_sessions,
  • total_time,
  • kcal_per_session,
  • reps_per_session,
  • BMI.
Summary statistic¶

NULL values, in table with numerical data, occur only for variables total_sessions, total_time, kcal_per_session, reps_per_session. In those variables there are only 3640 valid observations (19.48% of all observations). Taking into consideration only valid data, the summary statistics will be much different.

count mean std min 25% 50% 75% max var skewness kurtosis NULL count
height 3621.00 171.31 9.22 142.00 164.00 171.00 178.00 221.00 85.01 -0.00 -0.24 0
weight 3621.00 71.69 14.51 40.00 61.00 71.00 80.00 277.00 210.55 1.38 11.57 0
body_fat 3621.00 23.70 8.23 6.00 20.00 21.00 30.00 50.00 67.66 0.63 0.25 0
points 3621.00 56737.84 167897.30 0.00 300.00 2100.00 37673.00 2749450.00 28189503440.45 7.00 71.64 0
best_weekly_streak 3621.00 4.38 6.13 1.00 1.00 2.00 5.00 49.00 37.62 3.35 14.18 0
total_sessions 3621.00 18.85 35.69 1.00 2.00 5.00 19.00 922.00 1273.88 7.25 127.59 0
total_time 3621.00 23354.22 45339.17 0.00 1538.00 5104.00 21914.00 622509.00 2055640595.21 3.93 23.35 0
kcal_per_session 3621.00 48.92 144.61 0.00 5.09 24.08 68.00 4147.00 20913.17 19.21 460.72 0
reps_per_session 3621.00 10409.11 576637.05 0.00 11.00 45.00 125.00 34597012.00 332510290413.48 59.67 3578.36 0
BMI 3621.00 24.32 3.99 11.88 21.78 23.68 26.03 87.62 15.90 2.26 20.31 0

Median and IQR of height is the same, but mean rose from 169.67 cm to 170.43 cm and SD decreased from 23.09 to 15.23. Maximum height also decreased - from 1780 cm to 221 cm. Mean and median value of weight decreased respectively from 73.16 kg to 71.91 kg (SD also from 15.84 to 15.34) and from 72 kg to 71 kg. Minimum value increased from 22 kg to 40 kg. Mean value of body_fat decreased from 24.28% to 23.69% (also SD decreased from 8.60 to 8.22). Body_fat median decreased by 4 percent points (from 25% to 21%) and maximum decreased by 30 percent points (from 80% to 50%). In number of points everything increased except minimum and maximum value - they stayed the same. Mean went from 19478.15 to 57203.82, SD from 93727.46 to 169665.95, median from 100 (IQR 0 - 5047) to 2100 (300 - 37676.5). Best_weekly_streak among all of the users stayed at 49 weeks, median increased from 0 IQR (0 - 0) to 2 (IQR 1 - 5) and mean also increased from 0.85 (SD 3.21) to 4.37 (SD 6.12). Median value of total_session didnt change and is 5 (IQR 2 - 19), mean increased from 18.79 (SD 35.61) to 18.85 (SD 35.69). Maximum value stayed the same at 922 sessions. Median total_time (that is in minutes?) decreased from 5115.5 (IQR 1539.5 - 21869) to 5104 (IQR 1538 - 21914), mean increased from 23281.25 (SD 45236.45) to 23354.22 (SD 45339.17) and the minimum and maximum values stayed the same at respectively 0 and 622509. Average value of burned kilo calories per session (for every user separately) stayed the same at median 24 kcal (IQR 5 - 68) and the mean is 49 kcal (SD 144). Average number of reps per session (for every user separately) stayed the same at median 45 (IQR 11 -124), maximum value stayed the same at 34597012 and mean increased from 10355.27 (SD 575130.19) to 10409.11 (SD 576637). (There are extreme outliers here) Median BMI stayed at 24 - normal weight group (IQR 22-26), minimum increased from 0.27 (probably a mistake made by user) to 12, mean BMI value decreased from 25 - overweight (SD 4) to 24 - normal weight (SD 4) and the maximum stayed the same at 88 (probably also a mistake made by user - extreme outlier).

Normal distrubution evaluation¶

The normality of this subset of data is checked by the same method as previously.

Text(0.5, 0.98, 'Histogram plots for all numeric variables without NULLs')
Text(0.5, 1.05, 'QQ plots for all numeric variables without NULLs')
height weight body_fat points best_weekly_streak total_sessions total_time kcal_per_session reps_per_session BMI
W 0.99 0.95 0.94 0.36 0.59 0.51 0.54 0.19 0.01 0.89
pval 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
normal False False False False False False False False False False

As previously, there is no normality in data, even when the NULL data observations are omitted. Skewness and kurtosis are another proof of non-normality of data.

Data distribution¶

Again, the distribution of data will be checked for every variable. Goodness of fit will depend on RSS. In the analysis will be shown top $5$ best fits for each variable. They will be shown on a plot with value of RSS.

  • height - the best distribution is loggamma (RSS = 0.000189),
[distfit] >fit..
[distfit] >transform..
[distfit] >[loggamma ] [0.08 sec] [RSS: 0.00346306] [loc=-1688.805 scale=274.301]
[distfit] >[chi      ] [0.09 sec] [RSS: 0.0222726] [loc=142.000 scale=2.558]
[distfit] >[johnsonsb] [0.40 sec] [RSS: 0.00347174] [loc=-7155.538 scale=10180.728]
[distfit] >[powernorm] [0.12 sec] [RSS: 0.00347359] [loc=170.358 scale=8.902]
[distfit] >[logistic ] [0.00 sec] [RSS: 0.00381464] [loc=171.365 scale=5.378]
[distfit] >Compute confidence interval [parametric]
[distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
 <AxesSubplot:title={'center':'Best fit: loggamma'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
  • weight - the most fitting distribution is f distribution (RSS = 0.00006035),
[distfit] >fit..
[distfit] >transform..
[distfit] >[f        ] [0.13 sec] [RSS: 6.39223e-05] [loc=-4.684 scale=74.088]
[distfit] >[lognorm  ] [0.13 sec] [RSS: 6.37886e-05] [loc=11.428 scale=58.634]
[distfit] >[maxwell  ] [0.01 sec] [RSS: 5.45636e-05] [loc=37.901 scale=21.233]
[distfit] >[betaprime] [0.16 sec] [RSS: 6.0882e-05] [loc=-0.851 scale=25.570]
[distfit] >[erlang   ] [0.05 sec] [RSS: 6.11575e-05] [loc=29.325 scale=4.816]
[distfit] >Compute confidence interval [parametric]
[distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
 <AxesSubplot:title={'center':'Best fit: maxwell'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
  • body_fat - the best distribution is dgamma (RSS = 0.1674),
[distfit] >fit..
[distfit] >transform..
[distfit] >[dgamma     ] [0.02 sec] [RSS: 0.167269] [loc=22.741 scale=3.269]
[distfit] >[dweibull   ] [0.03 sec] [RSS: 0.169236] [loc=22.851 scale=7.322]
[distfit] >[gumbel_r   ] [0.00 sec] [RSS: 0.171813] [loc=19.832 scale=6.896]
[distfit] >[invweibull ] [0.18 sec] [RSS: 0.171811] [loc=-625523837.233 scale=625523857.082]
[distfit] >[genlogistic] [0.05 sec] [RSS: 0.171884] [loc=6.090 scale=6.470]
[distfit] >Compute confidence interval [parametric]
[distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
 <AxesSubplot:title={'center':'Best fit: dgamma'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
  • points - the most fitting distribution is Wald (RSS = 0.0000000000127),
[distfit] >fit..
[distfit] >transform..
[distfit] >[wald     ] [0.02 sec] [RSS: 1.26588e-11] [loc=-11170.683 scale=44342.275]
[distfit] >[exponnorm] [0.14 sec] [RSS: 2.02968e-11] [loc=3.789 scale=22.622]
[distfit] >[expon    ] [0.00 sec] [RSS: 2.04285e-11] [loc=0.000 scale=56737.837]
[distfit] >[genexpon ] [1.37 sec] [RSS: 2.04285e-11] [loc=-0.000 scale=110178.329]
[distfit] >[gilbrat  ] [0.04 sec] [RSS: 2.89726e-11] [loc=-3827.821 scale=13823.906]
[distfit] >Compute confidence interval [parametric]
[distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
 <AxesSubplot:title={'center':'Best fit: wald'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
  • best_weekly_streak - the most fitting distribution is gengamma (RSS = 0.00127),
[distfit] >fit..
[distfit] >transform..
[distfit] >[gengamma] [0.28 sec] [RSS: 0.00725802] [loc=1.000 scale=0.411]
[distfit] >[pearson3] [0.20 sec] [RSS: 0.00247779] [loc=2.608 scale=2.146]
[distfit] >[gilbrat ] [0.02 sec] [RSS: 0.0078055] [loc=0.555 scale=1.605]
[distfit] >[burr    ] [0.34 sec] [RSS: 0.23682] [loc=1.000 scale=0.000]
[distfit] >[alpha   ] [0.04 sec] [RSS: 0.0121778] [loc=0.671 scale=0.501]
[distfit] >Compute confidence interval [parametric]
[distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
 <AxesSubplot:title={'center':'Best fit: pearson3'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
  • total_sessions - the best distribution is Pearson distribution (RSS = 0.0000431),
[distfit] >fit..
[distfit] >transform..
[distfit] >[pearson3 ] [0.21 sec] [RSS: 4.58683e-05] [loc=10.929 scale=11.176]
[distfit] >[gilbrat  ] [0.02 sec] [RSS: 4.49006e-05] [loc=-0.776 scale=7.945]
[distfit] >[wald     ] [0.01 sec] [RSS: 4.75319e-05] [loc=-2.523 scale=16.480]
[distfit] >[exponnorm] [0.11 sec] [RSS: 0.000100346] [loc=0.985 scale=0.005]
[distfit] >[genexpon ] [1.41 sec] [RSS: 0.000100512] [loc=1.000 scale=27.758]
[distfit] >Compute confidence interval [parametric]
[distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
 <AxesSubplot:title={'center':'Best fit: gilbrat'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
  • total_time - the best distribution is half Cauchy (RSS = ),
[distfit] >fit..
[distfit] >transform..
[distfit] >[halfcauchy] [0.06 sec] [RSS: 1.35239e-11] [loc=-0.000 scale=5332.925]
[distfit] >[cauchy    ] [0.01 sec] [RSS: 4.21087e-11] [loc=3171.441 scale=3987.686]
[distfit] >[gilbrat   ] [0.03 sec] [RSS: 5.50217e-11] [loc=-1682.041 scale=9469.309]
[distfit] >[beta      ] [0.16 sec] [RSS: 1.30578e-10] [loc=-0.000 scale=112625113.232]
[distfit] >[wald      ] [0.02 sec] [RSS: 1.76875e-10] [loc=-3942.148 scale=20442.556]
[distfit] >Compute confidence interval [parametric]
[distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
 <AxesSubplot:title={'center':'Best fit: halfcauchy'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
  • kcal_per_session - the most fitted distribution is halflogistic (RSS = 0.0000000872),
[distfit] >fit..
[distfit] >transform..
[distfit] >[halflogistic] [0.04 sec] [RSS: 1.58525e-07] [loc=-0.000 scale=37.574]
[distfit] >[gumbel_r    ] [0.00 sec] [RSS: 5.16349e-07] [loc=24.265 scale=34.796]
[distfit] >[genlogistic ] [0.11 sec] [RSS: 5.41946e-07] [loc=-229.704 scale=34.830]
[distfit] >[t           ] [0.22 sec] [RSS: 8.07242e-07] [loc=31.008 scale=32.174]
[distfit] >[hypsecant   ] [0.01 sec] [RSS: 1.77855e-06] [loc=33.612 scale=34.775]
[distfit] >Compute confidence interval [parametric]
[distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
 <AxesSubplot:title={'center':'Best fit: halflogistic'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
  • reps_per_session - the best distribution is f (RSS = 0.00006035),
[distfit] >fit..
[distfit] >transform..
[distfit] >[f        ] [0.13 sec] [RSS: 6.39223e-05] [loc=-4.684 scale=74.088]
[distfit] >[lognorm  ] [0.13 sec] [RSS: 6.37886e-05] [loc=11.428 scale=58.634]
[distfit] >[maxwell  ] [0.01 sec] [RSS: 5.45636e-05] [loc=37.901 scale=21.233]
[distfit] >[betaprime] [0.15 sec] [RSS: 6.0882e-05] [loc=-0.851 scale=25.570]
[distfit] >[erlang   ] [0.05 sec] [RSS: 6.11575e-05] [loc=29.325 scale=4.816]
[distfit] >Compute confidence interval [parametric]
[distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
 <AxesSubplot:title={'center':'Best fit: maxwell'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
  • BMI - the best fitting distribution is Fisk (RSS = 0.00032).
[distfit] >fit..
[distfit] >transform..
[distfit] >[fisk     ] [0.23 sec] [RSS: 0.000320445] [loc=10.885 scale=12.877]
[distfit] >[exponnorm] [0.03 sec] [RSS: 0.000405039] [loc=21.083 scale=2.038]
[distfit] >[burr     ] [0.23 sec] [RSS: 0.000410933] [loc=-0.111 scale=21.703]
[distfit] >[mielke   ] [0.17 sec] [RSS: 0.000411179] [loc=-0.171 scale=21.757]
[distfit] >[johnsonsu] [0.28 sec] [RSS: 0.000503498] [loc=20.226 scale=4.310]
[distfit] >Compute confidence interval [parametric]
[distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
 <AxesSubplot:title={'center':'Best fit: fisk'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)

Most frequent distributions are f and Pearson distribution, then loggamma, dgamma, Laplace, Wald, generalized Gamma, half Cauchy, Fisk and half logistic.

Outliers detection¶

This section will be done later.

Categorical and boolean data¶

Categorical data¶

The variables taken as categorical are:

  • gender,
  • activity_level,
  • goal,
  • body_type,
  • language,
  • country,
  • affiliate_code_signup,
  • BMI_category.
Frequency table¶

Data can be looked through frequency tables with percentages that are shown below.

Frequency Percent Cumulative Percent
Variable factors
Gender
female 1411.00 38.97% 38.97%
male 2210.00 61.03% 100.0%
Total 3621.00 100.0% -
Activity_level
very active 348.00 9.61% 9.61%
active 2024.00 55.9% 65.51%
sedentary 1249.00 34.49% 100.0%
Total 3621.00 100.0% -
Goal
lose 1467.00 40.51% 40.51%
gain 1639.00 45.26% 85.78%
antiaging 515.00 14.22% 100.0%
Total 3621.00 100.0% -
Language
en 179.00 4.94% 4.94%
es 3442.00 95.06% 100.0%
Total 3621.00 100.0% -
Body_type
thin 1586.00 43.8% 43.8%
mid 1692.00 46.73% 90.53%
strong 343.00 9.47% 100.0%
Total 3621.00 100.0% -
BMI_category
Normal 2300.00 63.52% 63.52%
Obesity 295.00 8.15% 71.67%
Overweight 937.00 25.88% 97.54%
Underweight 89.00 2.46% 100.0%
Total 3621.00 100.0% -

From the data, we can see disproportion between men 2221 (61%) and women 1419 (39%). The biggest group with activity level active counts 2032 (56%) of subset of observations, then sedentary - 1255 (34%) observations and very active - 353 (10%) observations. In the whole dataset there is the same order, respectively 52%, 36% and 12% of all users. In this subset, for goal variable, the biggest count of occurrences is for gain group - 1647 (42%), then to lose - 1476 (41%) and the last is antiaging group with 517 (14%) observations. In the whole dataset, the biggest group is lose group with 8257 (44%) observations, then gain with 7838 (42%) observations and antiaging with 2593 (14%) observations. Almost 95% of this users subset chose Spanish (3449 observations) and 5% chose English for their app language. Mostly chosen body type is mid with 1700 (47%) observations, then it is thin with 1594 (44%) observations and the smallest group is strong with 346 (10%) observations. The biggest group in this subset is group of people with normal weight - 2300 (64%), then with overweight - 937 (26%), obesity - 295 (8%) and underweight - 89 (2%).

Total ES AR MX US CL FR CO DE CH ... JP KG LB LT LU MA ML MY NI JM
Frequency 1098 874 40 33 26 15 13 11 11 9 ... 0 0 0 0 0 0 0 0 0 0
Percent 100.0% 79.6% 3.64% 3.01% 2.37% 1.37% 1.18% 1.0% 1.0% 0.82% ... 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

2 rows × 80 columns

Most of the people that decided to share their country were, as previously, from Spain, but second biggest group is from Argentina (previously it was USA). Now, for the affiliate code.

Total endika fitness_revolucionario lifestyle_con_blanca mammothhunters mariapelazas cristinamanyer martina_ferrer_ keto_aove MyHixel ... Anavb87 janetgzzl MerakiFit gloriaalcalar gloria_martinez fullmusculo eat2winmedia dracaminodiaz anabel_freyes healthybyjane
Frequency 115 25 19 19 16 14 6 6 4 2 ... 0 0 0 0 0 0 0 0 0 0
Percent 100.0% 21.74% 16.52% 16.52% 13.91% 12.17% 5.22% 5.22% 3.48% 1.74% ... 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

2 rows × 28 columns

As previously, the most frequently used affiliate code was endika, but now second most used code is fitness_revolucionario, when previously it was mariapelezas. Number of valid observations decreased from 867 to 124.

Boolean data¶

The variables taken as boolean are:

  • newsletter_subscription,
  • notifications_setting,
  • training_days_setting,
  • scientific_data_usage.
Frequency table¶

Data can be looked through frequency tables with percentages after converting it to categorical values.

Frequency Percent
Variable factors
scientific_data_usage
False 2228.00 61.53%
True 1393.00 38.47%
Total 3621.00 100.0%
newsletter_subscription
False 1085.00 29.96%
True 2536.00 70.04%
Total 3621.00 100.0%
notifications_setting
False 69.00 1.91%
True 3552.00 98.09%
Total 3621.00 100.0%
training_days_setting
True 3621.00 100.0%
Total 3621.00 100.0%

Scientific data usage agreement decreased from 5858 to 1395 observations. Now agreement to scientific_data_usage is 38% of non-NULL observations. 2542 (70%) people signed up for newsletter_subscription. Number of people that turned on notification_settings is 3570 (98%). All of the users turned on training_days_setting.

Scientific usage of data agreement¶

Taking into consideration only data of users that agreed on scientific usage of their data, it is possible to prepare similar analysis.

Summary of number of NULLs and data types are given below. In total, there are 5858 observations.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5858 entries, 5 to 18687
Data columns (total 27 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   id                       5858 non-null   category      
 1   created_at               5858 non-null   datetime64[ns]
 2   updated_at               5858 non-null   datetime64[ns]
 3   gender                   5858 non-null   category      
 4   date_of_birth            5858 non-null   datetime64[ns]
 5   height                   5858 non-null   float64       
 6   weight                   5858 non-null   float64       
 7   activity_level           5858 non-null   category      
 8   goal                     5858 non-null   category      
 9   body_type                5858 non-null   category      
 10  body_fat                 5858 non-null   float64       
 11  newsletter_subscription  5858 non-null   bool          
 12  notifications_setting    5858 non-null   bool          
 13  training_days_setting    5858 non-null   bool          
 14  language                 5858 non-null   category      
 15  country                  215 non-null    category      
 16  points                   5858 non-null   int64         
 17  scientific_data_usage    5858 non-null   category      
 18  best_weekly_streak       5858 non-null   int64         
 19  affiliate_code_signup    13 non-null     category      
 20  total_sessions           1395 non-null   float64       
 21  total_time               1395 non-null   float64       
 22  kcal_per_session         1395 non-null   float64       
 23  reps_per_session         1395 non-null   float64       
 24  height[m]                5858 non-null   float64       
 25  BMI                      5853 non-null   float64       
 26  BMI_category             5853 non-null   category      
dtypes: bool(3), category(10), datetime64[ns](3), float64(9), int64(2)
memory usage: 1.4 MB

Numerical data¶

The variables taken into consideration as numerical data will be the same as before:

  • height,
  • weight,
  • body_fat,
  • points,
  • best_weekly_streak,
  • total_sessions,
  • total_time,
  • kcal_per_session,
  • reps_per_session,
  • BMI.

Summary statistics¶

Table with summary statistics (mean, standard deviation, minimum, maximum, quartiles, variance, skewness, kurtosis and NULL count) are given below. They may be different than the ones in the first table.

count mean std min 25% 50% 75% max var skewness kurtosis NULL count
height 5858.00 168.78 10.60 1.00 162.00 169.00 175.26 236.22 112.40 -3.03 48.42 0
weight 5858.00 72.06 16.63 38.55 60.00 70.00 81.00 277.00 276.72 1.29 6.29 0
body_fat 5858.00 24.93 8.62 6.60 20.00 25.00 30.00 50.00 74.24 0.60 0.08 0
points 5858.00 8239.90 67324.18 0.00 0.00 0.00 0.00 2463230.00 4532545127.76 16.21 397.37 0
best_weekly_streak 5858.00 0.94 3.45 0.00 0.00 0.00 0.00 49.00 11.88 7.51 72.43 0
total_sessions 1395.00 16.25 30.63 1.00 2.00 5.00 15.00 274.00 938.37 3.59 16.11 4463
total_time 1395.00 20087.24 43611.22 0.00 1165.00 3901.00 14918.50 336812.00 1901938878.89 3.82 17.07 4463
kcal_per_session 1395.00 52.88 159.27 0.00 7.01 31.25 69.00 4147.00 25366.99 18.33 405.04 4463
reps_per_session 1395.00 2001.32 71395.05 0.00 15.00 64.00 126.00 2666671.00 5097253462.07 37.35 1394.99 4463
BMI 5853.00 25.13 4.92 10.75 21.84 24.28 27.59 87.62 24.17 1.45 6.77 5

There are 5858 users in the subset of users data table. Comparing to whole users table, median height is lower - from 171 cm (with IQR 164-178) to 169 cm (IQR 162 - 175), mean height decreased from 169.67 cm (SD 23.09) to 168.78 cm (SD 10.6) and maximum height also decreased from 1780 cm to 236 cm. Median weight decreased from 72 kg (IQR 62-82) to 70 kg (IQR 60 - 81), where minimum increased from 22 kg to 39 kg and maximum stayed the same at 277 kg. Mean weight decreased from 73.16 kg (SD 15.84) to 72.06 kg (SD 16.63). Median and mean body fat stayed the same and are respectively 25% (IQR 20% - 30%) and 24.28% (SD 8.6), while minimum given body fat increased from 2% to 6.6% and maximum decreased from 80% to 50%. Median value of points decreased from 100 (IQR 0 - 5047) to 0 (IQR 0 - 0), maximum decreased from 2749450 to 2463230 and mean decreased from 19478.15 (SD 93727.46) to 8240 (SD 67324). Best_weekly_streak among all of the users stayed at 49 weeks, median is also the same - 0 IQR (0 - 0) and mean increased from 0.85 (SD 3.21) to 0.94 (SD 3.45). For total_sessions median value stayed the same, but IQR changed - all table: 5 (IQR 2 - 19), subset: 5 (2 - 15). Mean values of total_session decreased from 18.79 (SD 35.61) to 16.25 (SD 30.63). Maximum value decreased from 922 to 274 sessions. Median total_time (that is in minutes?) decreased from 5115.5 (IQR 1539.5 - 21869) to 3901 (IQR 1165 - 14919), mean decreased from 23281.25 (SD 45236.45) to 20087 (SD 43611), the minimum value stayed at 0 and the maximum value decreased from 622509 to 336812. Median of average value of burned kilo calories per session (for every user separately) increased from 24.08 kcal (IQR 5.15 - 68) to 31.25 kcal (IQR 7 - 69) and the mean increased from 48.99 kcal (SD 144.33) to 52.88 kcal (SD 159). Median of average number of reps per session (for every user separately) decreased from 45 (IQR 11 -124) to 64 (IQR 15 - 126), maximum value decreased from 34597012 to 2666671 and mean decreased from 10355.27 (SD 575130.19) to 2001.32 (SD 71395.05). Median BMI is 24 - normal weight group (IQR 22-28), minimum is 10.75, mean BMI value is 25 - overweight (SD 5) and the maximum is 88 (probably a mistake made by user - extreme outlier).

In the last five variables there are 4463 and 5 NULL observations, which gives 76% of NULL observations just for this subset. That means there are only 1393 valid observations for five last observations.

Normal distribution evaluation¶

Text(0.5, 0.98, 'Histogram plots for numeric variables of users table subset')
Text(0.5, 1.05, 'QQ plots for numeric variables of users table subset')
height weight body_fat points best_weekly_streak total_sessions total_time kcal_per_session reps_per_session BMI
W 0.86 0.94 0.94 0.10 0.29 0.54 0.49 0.18 0.01 0.93
pval 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
normal False False False False False False False False False False

From the table the plots above, Shapiro-Wilk test, skewness and kurtosis we can assume that none of the data has normal distribution.

Data distribution¶

The distribution of data will be checked for every variable. Goodness of fit will depend on RSS. In the analysis will be shown top $5$ best fits for each variable. They will be shown on a plot with value of RSS.

  • height - the best distribution is Alpa (RSS = 0.0000874),
[distfit] >fit..
[distfit] >transform..
[distfit] >[alpha        ] [0.11 sec] [RSS: 8.73989e-05] [loc=-410.812 scale=35696.799]
[distfit] >[pearson3     ] [0.13 sec] [RSS: 9.13647e-05] [loc=168.924 scale=9.449]
[distfit] >[chi          ] [0.14 sec] [RSS: 9.45143e-05] [loc=95.946 scale=13.413]
[distfit] >[vonmises_line] [0.66 sec] [RSS: 0.00010522] [loc=168.854 scale=53.439]
[distfit] >[exponnorm    ] [0.04 sec] [RSS: 0.000113741] [loc=168.779 scale=10.601]
[distfit] >Compute confidence interval [parametric]
[distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
 <AxesSubplot:title={'center':'Best fit: alpha'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
  • weight - the best distribution is Erlang (RSS = 0.0000516),
[distfit] >fit..
[distfit] >transform..
[distfit] >[erlang  ] [0.09 sec] [RSS: 5.16235e-05] [loc=31.492 scale=6.594]
[distfit] >[gamma   ] [0.04 sec] [RSS: 5.16236e-05] [loc=31.493 scale=6.594]
[distfit] >[chi2    ] [0.09 sec] [RSS: 5.16236e-05] [loc=31.492 scale=3.297]
[distfit] >[pearson3] [0.10 sec] [RSS: 5.16237e-05] [loc=72.055 scale=16.355]
[distfit] >[beta    ] [0.15 sec] [RSS: 5.19325e-05] [loc=31.742 scale=25056174.059]
[distfit] >Compute confidence interval [parametric]
[distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
 <AxesSubplot:title={'center':'Best fit: erlang'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
  • body_fat - the best fitting distribution is dgamma (RSS = 0.1867),
[distfit] >fit..
[distfit] >transform..
[distfit] >[dgamma     ] [0.02 sec] [RSS: 0.186749] [loc=23.155 scale=3.651]
[distfit] >[dweibull   ] [0.03 sec] [RSS: 0.188869] [loc=23.376 scale=7.645]
[distfit] >[triang     ] [0.25 sec] [RSS: 0.191619] [loc=6.572 scale=44.826]
[distfit] >[genlogistic] [0.05 sec] [RSS: 0.19202] [loc=7.855 scale=6.721]
[distfit] >[genextreme ] [0.31 sec] [RSS: 0.192084] [loc=21.238 scale=7.440]
[distfit] >Compute confidence interval [parametric]
[distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
 <AxesSubplot:title={'center':'Best fit: dgamma'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
  • points - the most fitting distribution is halflogistic (RSS = 0.00000000008397),
[distfit] >fit..
[distfit] >transform..
[distfit] >[halflogistic] [0.07 sec] [RSS: 8.39718e-11] [loc=-0.000 scale=8032.906]
[distfit] >[gompertz    ] [0.17 sec] [RSS: 1.11383e-10] [loc=-0.000 scale=791154496762173184.000]
[distfit] >[truncnorm   ] [0.28 sec] [RSS: 1.20445e-10] [loc=-535.690 scale=67750.432]
[distfit] >[foldnorm    ] [0.13 sec] [RSS: 1.21592e-10] [loc=-0.000 scale=67820.844]
[distfit] >[halfnorm    ] [0.04 sec] [RSS: 1.21592e-10] [loc=-0.000 scale=67820.847]
[distfit] >Compute confidence interval [parametric]
[distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
 <AxesSubplot:title={'center':'Best fit: halflogistic'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
  • best_weekly_streak - the best distribution is Gompertz distribution (RSS = 0.01592),
[distfit] >fit..
[distfit] >transform..
[distfit] >[gompertz] [0.18 sec] [RSS: 0.0159231] [loc=-0.000 scale=459.936]
[distfit] >[lomax   ] [0.06 sec] [RSS: 0.0189393] [loc=-0.000 scale=3.403]
[distfit] >[pearson3] [0.28 sec] [RSS: 0.0270229] [loc=0.570 scale=0.670]
[distfit] >[genexpon] [1.36 sec] [RSS: 0.0356292] [loc=-0.000 scale=2.023]
[distfit] >[expon   ] [0.00 sec] [RSS: 0.0356374] [loc=0.000 scale=0.943]
[distfit] >Compute confidence interval [parametric]
[distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
 <AxesSubplot:title={'center':'Best fit: gompertz'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)

For

  • total_sessions,
  • total_time,
  • kcal_per_session,
  • reps_per_session,
  • BMI

the distribution cannot be compute, because of the NULL values. It will be fitted in a later part of this document.

Every variable has different distribution (Alpha, Erlang, dgamma, folded normal, half logistic, Gompertz and generalized exponential distribution).

Outliers detection¶

Categorical and boolean data¶

Categorical data¶

The variables taken as categorical are:

  • gender,
  • activity_level,
  • goal,
  • body_type,
  • language,
  • country,
  • affiliate_code_signup,
  • BMI_category.
Frequency tables¶

Data can be looked through frequency tables with percentages that are shown below.

Frequency Percent Cumulative Percent
Variable factors
Gender
female 2981.00 50.89% 50.89%
male 2877.00 49.11% 100.0%
Total 5858.00 100.0% -
Activity_level
very active 531.00 9.06% 9.06%
active 2907.00 49.62% 58.69%
sedentary 2420.00 41.31% 100.0%
Total 5858.00 100.0% -
Goal
lose 2947.00 50.31% 50.31%
gain 2217.00 37.85% 88.15%
antiaging 694.00 11.85% 100.0%
Total 5858.00 100.0% -
Language
en 31.00 0.53% 0.53%
es 5827.00 99.47% 100.0%
Total 5858.00 100.0% -
Body_type
thin 2776.00 47.39% 47.39%
mid 2226.00 38.0% 85.39%
strong 856.00 14.61% 100.0%
Total 5858.00 100.0% -
BMI_category
Normal 3106.00 53.07% 53.07%
Obesity 846.00 14.45% 67.52%
Overweight 1667.00 28.48% 96.0%
Underweight 234.00 4.0% 100.0%
Total 5853.00 100.0% -

Male and female groups are almost equipotencial (female 51%, male 49%, in the whole dataset it was 42 % of female and 58% of male). Activity level is the biggest in active group 2907 (50%) observations, then in sedentary group 2420 (41%) and the smallest group is very active 531 (9%). The same group order is in the whole dataset (respectively 52%, 36% and 12%). Most of the users had a goal to lose weight - 2947 (50%), then to gain weight - 2219 (38%) and the smallest group had antiaging goal - 694 (12%). Again, there is the same order as in whole dataset (respectively 44%, 42% and 14%). Most of the people chose Spanish app language 5827 (99%), only 31 (1%) users chose English language. In this subset most of the users chose that their body type is thin - 2776 (47%), then mid - 2226 (38%) and strong - 586 (15%). Looking at the whole dataset analysis, the biggest group us mid (47%), then thin (41%) and strong (12%). Number of people with normal BMI is 3106 (53%), then overweight users are 1667 (28%), people with obesity - 846 (14%) and underweight people - 234 (4%).

Below is the frequency table and barplot of variable country.

Total ES AR MX CH US CA CL CO AU ... HR HU IN IT AE JP KG LB LT JM
Frequency 215 173 9 6 4 4 2 2 2 1 ... 0 0 0 0 0 0 0 0 0 0
Percent 100.0% 80.47% 4.19% 2.79% 1.86% 1.86% 0.93% 0.93% 0.93% 0.47% ... 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

2 rows × 80 columns

In the scientific_data_usage agreement subset only 215 users chose to give their country. 173 (80% of them chose Spain, 9 (4%) chose Argentina and 6 (3%) chose Mexico. In the whole dataset, 4866 (77%) chose Spain, 273 (4%) chose USA and 219 (3%) chose Argentina.

The frequency table and barplot of affiliate_code_signup is located below.

Total endika fitness_revolucionario mammothhunters martina_ferrer_ cristinamanyer mariapelazas keto_aove pablo_kuhnert nicotononpt ... gloriaalcalar gloria_martinez fullmusculo eat2winmedia dracaminodiaz blanca andreajuan anabel_freyes MyHixel healthybyjane
Frequency 13 4 3 3 1 1 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Percent 100.0% 30.77% 23.08% 23.08% 7.69% 7.69% 7.69% 0.0% 0.0% 0.0% ... 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

2 rows × 28 columns

In this data subset, total of the users that signed up by affiliate code is 13. Number of affiliate codes used are 6. Most frequent one is, as previously, endika - 4 (31%).

Boolean data¶

The variables taken as boolean are:

  • newsletter_subscription,
  • notifications_setting,
  • training_days_setting,
  • scientific_data_usage.
Frequency tables¶

Data can be looked through frequency tables with percentages after converting it to categorical values.

Frequency Percent
Variable factors
scientific_data_usage
False 0.00 0.0%
True 5858.00 100.0%
Total 5858.00 100.0%
newsletter_subscription
False 1079.00 18.42%
True 4779.00 81.58%
Total 5858.00 100.0%
notifications_setting
False 24.00 0.41%
True 5834.00 99.59%
Total 5858.00 100.0%
training_days_setting
True 5858.00 100.0%
Total 5858.00 100.0%

In this subset, 4779 (82%) of users signed up for newsletter_subscription and 5834 (99.6%) agreed on notification_settings (notifications).

Evaluation of subset only with valid data (from numeric data)¶

Looking at numeric data, there is only 1393 valid observations. In this valid data (according to numeric data) there is only 177 valid country observations, 609 valid current_last_sign_in and last_sign_in_at observations and 11 valid observations of affiliate_code_signup.

Below there are barplot and document with types and non-NULL count of observations.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1393 entries, 5 to 18660
Data columns (total 27 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   id                       1393 non-null   category      
 1   created_at               1393 non-null   datetime64[ns]
 2   updated_at               1393 non-null   datetime64[ns]
 3   gender                   1393 non-null   category      
 4   date_of_birth            1393 non-null   datetime64[ns]
 5   height                   1393 non-null   float64       
 6   weight                   1393 non-null   float64       
 7   activity_level           1393 non-null   category      
 8   goal                     1393 non-null   category      
 9   body_type                1393 non-null   category      
 10  body_fat                 1393 non-null   float64       
 11  newsletter_subscription  1393 non-null   bool          
 12  notifications_setting    1393 non-null   bool          
 13  training_days_setting    1393 non-null   bool          
 14  language                 1393 non-null   category      
 15  country                  177 non-null    category      
 16  points                   1393 non-null   int64         
 17  scientific_data_usage    1393 non-null   category      
 18  best_weekly_streak       1393 non-null   int64         
 19  affiliate_code_signup    11 non-null     category      
 20  total_sessions           1393 non-null   float64       
 21  total_time               1393 non-null   float64       
 22  kcal_per_session         1393 non-null   float64       
 23  reps_per_session         1393 non-null   float64       
 24  height[m]                1393 non-null   float64       
 25  BMI                      1393 non-null   float64       
 26  BMI_category             1393 non-null   category      
dtypes: bool(3), category(10), datetime64[ns](3), float64(9), int64(2)
memory usage: 849.2 KB

Numerical data¶

The variables taken into consideration as numerical data will be the same as before:

  • height,
  • weight,
  • body_fat,
  • points,
  • best_weekly_streak,
  • total_sessions,
  • total_time,
  • kcal_per_session,
  • reps_per_session,
  • BMI.
Summary statistics¶

NULL values, in table with numerical data in this subset, occur only for variables total_sessions, total_time, kcal_per_session, reps_per_session. In those variables there are only 1395 valid observations (23.81% of all observations from the subset). Taking into consideration only valid data for our subset, the summary statistics is given below.

count mean std min 25% 50% 75% max var skewness kurtosis NULL count
height 1393.00 170.24 9.27 142.00 163.00 170.00 177.00 202.00 86.01 0.07 -0.34 0
weight 1393.00 71.83 15.86 40.00 60.00 70.00 80.00 277.00 251.39 2.02 19.77 0
body_fat 1393.00 24.58 8.56 6.60 20.00 25.00 30.00 50.00 73.30 0.53 0.00 0
points 1393.00 32792.22 133700.91 0.00 200.00 500.00 2700.00 2463230.00 17875934333.43 8.11 99.33 0
best_weekly_streak 1393.00 3.96 6.16 1.00 1.00 2.00 4.00 49.00 37.99 3.95 18.85 0
total_sessions 1393.00 16.26 30.65 1.00 2.00 5.00 15.00 274.00 939.64 3.59 16.08 0
total_time 1393.00 20104.16 43640.24 0.00 1165.00 3879.00 14937.00 336812.00 1904470598.41 3.81 17.04 0
kcal_per_session 1393.00 52.92 159.38 0.00 7.00 31.25 69.00 4147.00 25402.32 18.32 404.47 0
reps_per_session 1393.00 2004.16 71446.28 0.00 15.00 64.00 126.00 2666671.00 5104571479.21 37.32 1392.99 0
BMI 1393.00 24.66 4.46 13.78 21.87 23.95 26.67 87.62 19.91 2.77 28.95 0

There are 1395 users in the subset of scientific_data_usage subset data table. Comparing to the scientific_data_usage subset table, median height is bigger - from 169 cm (IQR 162 - 175) to 179 cm (IQR 163 - 177), mean height increased from 168.78 cm (SD 10.6) to 170 cm (SD 11.18) and maximum height decreased from 236 cm to 202 cm. Median weight stayed the same - 70 kg (IQR 60 - 81), where minimum increased from 39 kg to 40 kg and maximum stayed the same at 277 kg. Mean weight decreased from 72.06 kg (SD 16.63) to 71.93 kg (SD 16.07). Median body fat stayed the same at 25% (IQR 20% - 30%) and mean body fat increased from 24.28% (SD 8.6) to 24.58% (SD 8.56), while minimum and maximum given body fat stayed the same at respectively 6.6% and 50%. Median value of points increased from 0 (IQR 0 - 0) to 500 (IQR 200 - 2800), maximum stayed the same at 2463230 and mean increased from 8240 (SD 67324) to 32765.54 (SD 133606.96). Best_weekly_streak among the subset of the users stayed at 49 weeks, median increased from 0 IQR (0 - 0) to 2 (IQR 1 - 4) and mean increased from 0.94 (SD 3.45) to 3.96 (SD 6.16). Median, mean and maximum value for toal_sessions stayed the same respectively at 5 (IQR 2 - 15), 16.25 (SD 30.63) and 274 sessions. Median, mean, minimum and maximum of total_time (that is in minutes?) stayed the same at respectively 3901 (IQR 1165 - 14919) and 20087 (SD 43611), 0 and 336812. Median and mean of average value of burned kilo calories per session (for every user separately) stayed the same at 31.25 kcal (IQR 7 - 69) and 52.88 kcal (SD 159). Median. maximum and mean of average number of reps per session (for every user separately) stayed the same at 64 (IQR 15 - 126), 2666671 and 2001.32 (SD 71395.05). Median BMI stayed the same at 24 - normal weight group (IQR 22-27), minimum increased from 11 to 14, mean BMI value stayed the same 25 - overweight (SD 4) and the maximum also stayed the same at 88 (probably a mistake made by user - extreme outlier).

Normal distribution evaluation¶

The normality of this subset of data is checked by the same method as previously.

Text(0.5, 0.98, 'Histogram plots for all numeric variables without NULLs')
Text(0.5, 1.05, 'QQ plots for all numeric variables without NULLs')
height weight body_fat points best_weekly_streak total_sessions total_time kcal_per_session reps_per_session BMI
W 0.99 0.91 0.95 0.26 0.52 0.54 0.49 0.18 0.01 0.87
pval 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
normal False False False False False False False False False False

As previously, there is no normality in data, even when the NULL data observations are omitted. Skewness and kurtosis are another proof of non-normality of data.

Data distribution¶

The distribution of data will be checked for every variable. Goodness of fit will depend on RSS. In the analysis will be shown top $5$ best fits for each variable. They will be shown on a plot with value of RSS.

  • height - the most approxiame distribution is Chi (RSS = 0.000132),
[distfit] >fit..
[distfit] >transform..
[distfit] >[chi         ] [0.06 sec] [RSS: 0.00502087] [loc=95.889 scale=13.171]
[distfit] >[beta        ] [0.03 sec] [RSS: 0.0048946] [loc=131.393 scale=84.604]
[distfit] >[t           ] [0.18 sec] [RSS: 0.00502322] [loc=170.237 scale=9.270]
[distfit] >[powerlognorm] [0.37 sec] [RSS: 0.00503724] [loc=-1.809 scale=174.453]
[distfit] >[burr        ] [0.25 sec] [RSS: 0.00534082] [loc=-2.278 scale=169.614]
[distfit] >Compute confidence interval [parametric]
[distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
 <AxesSubplot:title={'center':'Best fit: beta'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
  • weight - the best distribustion is Maxwell distribution (RSS = 0.0000908),
[distfit] >fit..
[distfit] >transform..
[distfit] >[maxwell ] [0.00 sec] [RSS: 8.84655e-05] [loc=36.067 scale=22.585]
[distfit] >[erlang  ] [0.07 sec] [RSS: 9.22636e-05] [loc=30.375 scale=5.630]
[distfit] >[pearson3] [0.05 sec] [RSS: 9.31583e-05] [loc=71.831 scale=15.368]
[distfit] >[gamma   ] [0.04 sec] [RSS: 9.31579e-05] [loc=31.061 scale=5.793]
[distfit] >[chi2    ] [0.02 sec] [RSS: 9.31577e-05] [loc=31.061 scale=2.897]
[distfit] >Compute confidence interval [parametric]
[distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
 <AxesSubplot:title={'center':'Best fit: maxwell'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
  • body_fat - the closest distribution is dgamma (RSS = 0.17296),
[distfit] >fit..
[distfit] >transform..
[distfit] >[dgamma    ] [0.02 sec] [RSS: 0.172946] [loc=23.122 scale=3.555]
[distfit] >[dweibull  ] [0.02 sec] [RSS: 0.174803] [loc=23.257 scale=7.674]
[distfit] >[triang    ] [0.16 sec] [RSS: 0.177868] [loc=6.436 scale=44.631]
[distfit] >[genextreme] [0.10 sec] [RSS: 0.178235] [loc=20.955 scale=7.518]
[distfit] >[lognorm   ] [0.07 sec] [RSS: 0.178248] [loc=-12.397 scale=36.007]
[distfit] >Compute confidence interval [parametric]
[distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
 <AxesSubplot:title={'center':'Best fit: dgamma'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
  • points - the most approximate distribution is exponentially modified gaussian distribution (exponnorm, RSS = 0.00000000002251),
[distfit] >fit..
[distfit] >transform..
[distfit] >[exponnorm      ] [0.09 sec] [RSS: 2.24662e-11] [loc=-3.063 scale=17.928]
[distfit] >[genexpon       ] [1.35 sec] [RSS: 2.25394e-11] [loc=-0.000 scale=81245.812]
[distfit] >[expon          ] [0.00 sec] [RSS: 2.25394e-11] [loc=0.000 scale=32792.217]
[distfit] >[halflogistic   ] [0.03 sec] [RSS: 3.90605e-11] [loc=-0.000 scale=30638.370]
[distfit] >[genhalflogistic] [0.09 sec] [RSS: 3.8801e-11] [loc=-0.495 scale=30579.066]
[distfit] >Compute confidence interval [parametric]
[distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
 <AxesSubplot:title={'center':'Best fit: exponnorm'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
  • best_weekly_streak - the best distribution is Gilbrat's distribution (RSS = 0.00466),
[distfit] >fit..
[distfit] >transform..
[distfit] >[gilbrat ] [0.02 sec] [RSS: 0.00470744] [loc=0.639 scale=1.244]
[distfit] >[beta    ] [0.10 sec] [RSS: 0.0739864] [loc=1.000 scale=694.957]
[distfit] >[pearson3] [0.13 sec] [RSS: 0.0100343] [loc=2.878 scale=2.142]
[distfit] >[cauchy  ] [0.00 sec] [RSS: 0.014674] [loc=1.234 scale=0.669]
[distfit] >[wald    ] [0.01 sec] [RSS: 0.0194036] [loc=0.312 scale=2.746]
[distfit] >Compute confidence interval [parametric]
[distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
 <AxesSubplot:title={'center':'Best fit: gilbrat'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
  • total_sessions - the most approximate distribution is folded Cauchy distribution (RSS = 0.00009373),
[distfit] >fit..
[distfit] >transform..
[distfit] >[foldcauchy] [0.04 sec] [RSS: 9.84436e-05] [loc=1.000 scale=3.184]
[distfit] >[halfcauchy] [0.02 sec] [RSS: 9.90314e-05] [loc=1.000 scale=3.196]
[distfit] >[cauchy    ] [0.00 sec] [RSS: 0.000152909] [loc=2.988 scale=2.849]
[distfit] >[alpha     ] [0.01 sec] [RSS: 0.000374838] [loc=0.001 scale=1.883]
[distfit] >[t         ] [0.08 sec] [RSS: 0.000391715] [loc=2.467 scale=2.082]
[distfit] >Compute confidence interval [parametric]
[distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
 <AxesSubplot:title={'center':'Best fit: foldcauchy'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
  • total_time - the best distribution is half Cauchy (RSS = 0.00000000003481),
[distfit] >fit..
[distfit] >transform..
[distfit] >[halfcauchy ] [0.05 sec] [RSS: 3.61999e-11] [loc=-0.000 scale=3962.202]
[distfit] >[cauchy     ] [0.01 sec] [RSS: 1.08565e-10] [loc=2464.453 scale=2924.470]
[distfit] >[t          ] [0.16 sec] [RSS: 1.38392e-10] [loc=1925.511 scale=1991.077]
[distfit] >[tukeylambda] [1.63 sec] [RSS: 1.9457e-10] [loc=1884.669 scale=577.463]
[distfit] >[gilbrat    ] [0.02 sec] [RSS: 4.68186e-10] [loc=-1254.375 scale=7147.656]
[distfit] >Compute confidence interval [parametric]
[distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
 <AxesSubplot:title={'center':'Best fit: halfcauchy'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
  • kcal_per_session - the closest distribution approximation is half logistic (RSS = 0.000000314),
[distfit] >fit..
[distfit] >transform..
[distfit] >[halflogistic   ] [0.03 sec] [RSS: 3.15684e-07] [loc=-0.000 scale=39.855]
[distfit] >[genhalflogistic] [0.09 sec] [RSS: 3.17229e-07] [loc=-0.000 scale=39.864]
[distfit] >[gumbel_r       ] [0.00 sec] [RSS: 3.90096e-07] [loc=27.268 scale=36.244]
[distfit] >[genlogistic    ] [0.07 sec] [RSS: 4.08793e-07] [loc=-232.292 scale=36.275]
[distfit] >[dgamma         ] [0.06 sec] [RSS: 1.47624e-05] [loc=60.000 scale=65.421]
[distfit] >Compute confidence interval [parametric]
[distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
 <AxesSubplot:title={'center':'Best fit: halflogistic'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
  • reps_per_session - the best distribution is truncated normal (RSS = 0.0000000001078).
[distfit] >fit..
[distfit] >transform..
[distfit] >[truncnorm] [0.26 sec] [RSS: 1.07664e-10] [loc=-597.181 scale=71349.997]
[distfit] >[foldnorm ] [0.05 sec] [RSS: 1.08756e-10] [loc=-0.000 scale=71432.140]
[distfit] >[halfnorm ] [0.02 sec] [RSS: 1.09915e-10] [loc=-0.000 scale=71874.572]
[distfit] >[rice     ] [0.04 sec] [RSS: 1.37732e-10] [loc=-71305.258 scale=72371.189]
[distfit] >[rayleigh ] [0.00 sec] [RSS: 1.37732e-10] [loc=-71305.259 scale=72371.189]
[distfit] >Compute confidence interval [parametric]
[distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
 <AxesSubplot:title={'center':'Best fit: truncnorm'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)
  • BMI - the most fiting distrbution is Fisk (RSS = 0.000277).
[distfit] >fit..
[distfit] >transform..
[distfit] >[fisk     ] [0.14 sec] [RSS: 0.000277334] [loc=11.018 scale=13.020]
[distfit] >[exponnorm] [0.02 sec] [RSS: 0.000311229] [loc=21.081 scale=2.296]
[distfit] >[burr     ] [0.14 sec] [RSS: 0.000319281] [loc=-0.107 scale=21.995]
[distfit] >[mielke   ] [0.10 sec] [RSS: 0.000319468] [loc=-0.190 scale=22.069]
[distfit] >[johnsonsu] [0.26 sec] [RSS: 0.000383062] [loc=19.490 scale=5.247]
[distfit] >Compute confidence interval [parametric]
[distfit] >plot summary..
(<Figure size 900x400 with 1 Axes>,
 <AxesSubplot:title={'center':'Best fit: fisk'}, xlabel='Distribution name', ylabel='RSS (lower is better)'>)

Most frequent distributions is dgamma, then there are Chi, Maxwell, exponentially modified Gaussian (exponnorm), Gilbrat, Pearson, folded Cauchy, half logistic, Fisk and truncated normal.

Outliers detection¶

This section will be done later.

Categorical and boolean data¶

Categorical data¶

The variables taken as categorical are:

  • gender,
  • activity_level,
  • goal,
  • body_type,
  • language,
  • country,
  • affiliate_code_signup,
  • BMI_category.
Frequency tables¶

Data can be looked through frequency tables with percentages that are shown below.

Frequency Percent Cumulative Percent
Variable factors
Gender
female 607.00 43.58% 43.58%
male 786.00 56.42% 100.0%
Total 1393.00 100.0% -
Activity_level
very active 127.00 9.12% 9.12%
active 723.00 51.9% 61.02%
sedentary 543.00 38.98% 100.0%
Total 1393.00 100.0% -
Goal
lose 628.00 45.08% 45.08%
gain 583.00 41.85% 86.93%
antiaging 182.00 13.07% 100.0%
Total 1393.00 100.0% -
Language
en 22.00 1.58% 1.58%
es 1371.00 98.42% 100.0%
Total 1393.00 100.0% -
Body_type
thin 646.00 46.37% 46.37%
mid 582.00 41.78% 88.16%
strong 165.00 11.84% 100.0%
Total 1393.00 100.0% -
BMI_category
Normal 809.00 58.08% 58.08%
Obesity 147.00 10.55% 68.63%
Overweight 395.00 28.36% 96.98%
Underweight 42.00 3.02% 100.0%
Total 1393.00 100.0% -

In this subset of scientific_data_usage agreement table women are 608 (44%) observations and men are 787 (56%) observations. The biggest activity level group is active with 724 (52%) observations, then in sedentary group with 543 (39%) observations and the smallest group is very active with 128 (9%) observations. Most of the users from this subset had a goal to lose weight - 630 (45%), then to gain weight - 583 (42%) and the smallest group is antiaging goal - 182 (13%). Most of the people chose Spanish app language 1373 (98%), only 22 (2%) uers chose English language. In this subset most of the users chose that their body type is thin - 648 (46%), then mid - 582 (42%) and strong - 165 (12%). Biggest number of occureces for BMI category is normal category - 809 (58%), then overweight - 395 (28%), obesity - 147 (11%) and underweight - 42 (3%).

Below is the frequency table and barplot of variable country.

Total ES AR MX CH CO CA CL AU DO ... HR HU IN IT AE JP KG LB LT JM
Frequency 177 145 7 5 3 2 2 2 1 1 ... 0 0 0 0 0 0 0 0 0 0
Percent 100.0% 81.92% 3.95% 2.82% 1.69% 1.13% 1.13% 1.13% 0.56% 0.56% ... 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

2 rows × 80 columns

In the scientific_data_usage agreement non-null subset only 179 users chose to give their country name. 145 (81% of them chose Spain, 7 (4%) chose Argentina and 5 (3%) chose Mexico.

The frequency table and barplot of affiliate_code_signup is located below.

Total fitness_revolucionario mammothhunters endika martina_ferrer_ cristinamanyer mariapelazas keto_aove pablo_kuhnert nicotononpt ... gloriaalcalar gloria_martinez fullmusculo eat2winmedia dracaminodiaz blanca andreajuan anabel_freyes MyHixel healthybyjane
Frequency 11 3 3 2 1 1 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Percent 100.0% 27.27% 27.27% 18.18% 9.09% 9.09% 9.09% 0.0% 0.0% 0.0% ... 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

2 rows × 28 columns

In this data subset, total of the users that signed up by affiliate code is 11. Number of affiliate codes used are 6. Most frequent one is mammothhunters and fitness_revolucionario with count of 3 (27%) each.

Boolean data¶

The variables taken as boolean are:

  • newsletter_subscription,
  • notifications_setting,
  • training_days_setting,
  • scientific_data_usage.
Frequency tables¶
Frequency Percent
Variable factors
scientific_data_usage
False 0.00 0.0%
True 1393.00 100.0%
Total 1393.00 100.0%
newsletter_subscription
False 292.00 20.96%
True 1101.00 79.04%
Total 1393.00 100.0%
notifications_setting
False 16.00 1.15%
True 1377.00 98.85%
Total 1393.00 100.0%
training_days_setting
True 1393.00 100.0%
Total 1393.00 100.0%

In this subset, 1103 (79%) of users signed up for newsletter_subscription and 1379 (99%) agreed on notification_settings (notifications).

User_achievements table¶

User_achievement table characteristics¶

Table user_achievements contains 31765 observations where there can be multiple observations for each user. The data frame contains:

  • ID - unique number of 'operation',
  • User_id - ID of user,
  • Achievement_id - ID of achievement,
  • created_at - when was the achievement created,
  • updated_at - when it was last updated (it can be the same date and time as created_at).

Below there are information about data types and non-null values.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31765 entries, 0 to 31764
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   id             31765 non-null  int64         
 1   user_id        31765 non-null  int64         
 2   achievment_id  31765 non-null  int64         
 3   created_at     31765 non-null  datetime64[ns]
 4   updated_at     31765 non-null  datetime64[ns]
dtypes: datetime64[ns](2), int64(3)
memory usage: 1.2 MB

In the analysis, only the last achievement of the user will be taken into consideration. Now, there is only 3625 observations ($11.4\%$ of whole table) and values for user_ID are finally unique. Below there are data types and non-null count of values for this subset.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3625 entries, 4 to 31764
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   id             3625 non-null   int64         
 1   user_id        3625 non-null   int64         
 2   achievment_id  3625 non-null   int64         
 3   created_at     3625 non-null   datetime64[ns]
 4   updated_at     3625 non-null   datetime64[ns]
dtypes: datetime64[ns](2), int64(3)
memory usage: 169.9 KB

Treating achievement_id as numerical value, the summary statistics (mean, standard deviation, minimum, maximum, quartiles, variance, skewness, kurtosis and NULL count) are given below.

count mean std min 25% 50% 75% max var skewness kurtosis NULL count
achievment_id 3625.00 10.74 8.95 3.00 3.00 6.00 17.00 34.00 80.14 0.85 -0.66 0

Mean achievement is 11, that is Gorilla, minimum achievement is 3 (Catepillar - the start one), maximum achievement is 34 (Just a regular folk) and median is 6 (Chipmunk). There is no NULL data.

On the other hand, there is a possibility to treat achievement_id as a categorical variable. Then, it can be seen that most of the users had achievement 3 (Catepillar) - 1154 (32%) users, then 4 (Snail) - 414 (11%) users and 5 (Turtle) - 235 (6%) users. The achievement that number of occurences is the smallest is 34 (Just a regular folk) - 2 (0.06%) users. Below there is a frequency table sorted in descending order and barplot with numer of occurrences of each achievement_id.

Frequency Percent
Total 3625 100.0%
3 1154 31.83%
4 414 11.42%
5 235 6.48%
24 156 4.3%
23 128 3.53%
13 119 3.28%
6 117 3.23%
17 103 2.84%
14 102 2.81%
22 88 2.43%
25 78 2.15%
9 76 2.1%
7 76 2.1%
16 70 1.93%
21 68 1.88%
15 67 1.85%
10 57 1.57%
8 57 1.57%
26 51 1.41%
19 51 1.41%
11 51 1.41%
29 46 1.27%
33 46 1.27%
20 45 1.24%
32 40 1.1%
12 34 0.94%
28 34 0.94%
27 29 0.8%
18 18 0.5%
31 13 0.36%
34 2 0.06%

Connection to Users table¶

In user_achievements table there is user_id, so it would be proper, to merge users and user_achievements tables (all of the values of id in users table and user_id values from user_achievements table that match).

From the merged tables, there are only 5 columns chosen:

  • id_x - ID from user_achievements table,
  • user_id - ID of users,
  • achievement_id - ID of achievement,
  • id_y - ID from users table,
  • points - points scored by the user.

Below there is a glimpse of this table.

id_x user_id achievment_id id_y points
0 NaN NaN NaN 1880 25884
1 247.00 747.00 3.00 747 100
2 NaN NaN NaN 3469 580
3 NaN NaN NaN 1876 0
4 NaN NaN NaN 1886 11014
5 29073.00 1264.00 4.00 1264 650
6 NaN NaN NaN 1875 0
7 NaN NaN NaN 1877 0
8 8306.00 8228.00 3.00 8228 350
9 NaN NaN NaN 1874 65338

It is seen that some of the users that have points, don't have achievements. Below are stated numbers that count all four situations that could happened there.

Have 0 points and achievement assigned: 0 
 Have 0 points and no achievement assigned: 9183 
 Have points and no achievement assigned: 5880 
 Have points and achievement assigned: 3625

Sum of users in all of these situations is equal to 18688 (equal to number of observations from users table). That means, if in the analysis would be used only the last possibility (have points and achievement assigned) there would be only 3625 observations. Every user should have some achievement assigned at start, so it would be best, to assign achievement to every user according to points from achievements table. Then, there would be the biggest number of observations to analyze.

User_programs table¶

User_programs table characteristics¶

Table user_programs contains 81321 observations where there can be multiple observations for each user. The data frame contains:

  • ID - unique number of 'operation',
  • User_id - ID of user,
  • program_id - ID of program,
  • created_at - when was the program probably started,
  • updated_at - when was the program probably finished/updated,
  • active - is is the current program the user is doing,
  • current_session_id - id of session when the program was executed,
  • completed - if the program is competed,
  • enjoyment - number of points of enjoyment (from the scale?), only available when completed is true (when user completed the program).

Below there are information about data types and non-null values. User_id and program_id will be treated as categories. Below, there is also table, that counts how many completions of programs are.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81321 entries, 0 to 81320
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   id                  81321 non-null  int64         
 1   user_id             81321 non-null  category      
 2   program_id          81321 non-null  category      
 3   created_at          81321 non-null  datetime64[ns]
 4   updated_at          81321 non-null  datetime64[ns]
 5   active              81321 non-null  bool          
 6   current_session_id  81305 non-null  float64       
 7   completed           81321 non-null  bool          
 8   enjoyment           1599 non-null   float64       
 9   enjoyment_notes     164 non-null    object        
dtypes: bool(2), category(2), datetime64[ns](2), float64(2), int64(1), object(1)
memory usage: 4.8+ MB
Frequency Percent
False 72227 88.82%
True 9094 11.18%
Total 81321 100.0%

From the tables above, it is seen that only 9094 times programs were completed by users (11% of all started programs). From this 9094 times, only 1599 (18%) gave enjoyment feedback and only 164 times were given written feedback (2%).

Below there is a table with currently active programs.

Frequency Percent
False 66181 81.38%
True 15140 18.62%
Total 81321 100.0%

There are 15140 (19%) currently active programs. The table and barplot below represents ten users, that started the biggest number of programs. Users that started the biggest number of programs are id 360 (programs started - 396), 706 (54), 989 (41), 1561 (38), 7094 (38).

user_id 360 708 989 1561 7094 875 8055 6271 7948 5093
program_id 396 54 41 38 38 37 37 36 36 36

It is possible to make a subset of only completed programs. Then the most programs were completed by user with id 3169 (18 completed programs). Second biggest number of completed programs by one user is 16 by user with id 2390.

user_id 3169 2390 2677 1718 2526 1799 7761 13552 1855 2013 1860 1285 3111 3216 1857 1350 2648 8165 1333 3214
program_id 18 16 13 11 10 10 9 9 8 8 8 8 8 8 7 7 7 7 7 7

Table and barplot below show most frequently completed programs. Programs that users most frequently started have id 5, 36, 29, 504, 10, 30, 34, 6, 12, 38, 7.

program_id 5 36 29 504 10 30 34 6 12 38 7 428 23 39 13
user_id 20984 20600 9131 7286 2207 1912 1600 1571 913 876 827 817 801 794 789

Table and barplot below show top 20 most frequently completed programs. Most frequently completed program is program number 504 with 7286 completions. Second most frequently completed program is program number 6 with 197 completions.

program_id 504 6 10 29 7 12 428 23 13 16 14 8 34 30 25 503 22 9 500 26
user_id 7286 197 194 130 117 109 98 94 84 68 66 62 59 44 42 41 39 39 36 35

Connection to Users table of completed programs¶

It is possible to merge tables with users (id) and user_programs (user_id) to get characteristics for specific groups, programs and whatever is connected. It is possible to compare gender, activity level, goal, body type, notification settings, language, BMI category and number of programs completed.

Firstly, connected table will be filtered on completed programs. Count of completed programs for each user and number of points is presented below.

count points BMI notification_settings scientific_data_usage
user_id
3169 18 370089 24.24 True False
2390 16 838205 25.77 True False
2677 13 583894 30.67 True False
1718 11 1810142 24.52 True False
2526 10 308822 19.23 True False

The biggest number of the completed programs is 18 for user 3169, who has weight in norm. What is curious, is that number of points for this user is not the biggest one. The biggest number of points is 2749450 for user 1442, who is overweight (BMI = 25.99) and completed only 2 programs (the head of this table is shown below). Both of the users, who have the biggest number of completed programs and the one with the biggest number of points, had notifications settings turned on (but none of them agreed on scientific data usage).

count points BMI notification_settings scientific_data_usage
user_id
1442 2 2749450 25.99 True False
2978 1 2741622 24.72 True False
3061 3 2482727 21.91 True False
889 2 2463230 28.40 True True
3007 1 2305978 17.36 True False

Below there are frequency tables for completed programs.

Frequency Percent Cumulative Percent
Variable factors
Gender
female 2539.00 27.92% 27.92%
male 6555.00 72.08% 100.0%
Total 9094.00 100.0% -
Activity_level
very active 1317.00 14.48% 14.48%
active 5239.00 57.61% 72.09%
sedentary 2538.00 27.91% 100.0%
Total 9094.00 100.0% -
Goal
lose 3217.00 35.37% 35.37%
gain 4398.00 48.36% 83.74%
antiaging 1479.00 16.26% 100.0%
Total 9094.00 100.0% -
Language
en 132.00 1.45% 1.45%
es 8962.00 98.55% 100.0%
Total 9094.00 100.0% -
Body_type
thin 3286.00 36.13% 36.13%
mid 4930.00 54.21% 90.35%
strong 878.00 9.65% 100.0%
Total 9094.00 100.0% -
BMI_category
Normal 5656.00 62.33% 62.33%
Obesity 658.00 7.25% 69.58%
Overweight 2602.00 28.67% 98.25%
Underweight 159.00 1.75% 100.0%
Total 9075.00 100.0% -

In this subset of users, most of the completed programs, were completed by men - 6555 times (72%) and woman completed 2539 programs (28% of all completed programs). People with activity level active completed 5239 programs (58% of all completed programs), with level sedentary completed 2538 (28% of all completed programs) and with level vary active - 1317 completed programs (14% of all completed). People with goal of gaining weight completed 4396 programs (48%), with lose - 3217 programs (35%) and with antiaging goal 1479 programs were completed (16%). People with English language completed 132 programs (1%) and people that used Spanish in their app completed 8962 programs (99%). 4930 (54%) completed programs were completed by mid body type, people with thin body type completed 3286 (36%) programs and people with strong body type completed 878 programs (10%). Most of the programs was completed by people with normal weight - 5656 (62%), then second biggest group was overweight group with 2602 completed programs (29%), then people with obesity - 658 (7%) completed programs and the smallest group is for people with underweight - 159 (2%) completed programs.

Below there are frequency tables for boolean variables.

Frequency Percent
Variable factors
scientific_data_usage
False 8095.00 89.01%
True 999.00 10.99%
Total 9094.00 100.0%
newsletter_subscription
False 1896.00 20.85%
True 7198.00 79.15%
Total 9094.00 100.0%
notifications_setting
False 113.00 1.24%
True 8981.00 98.76%
Total 9094.00 100.0%

Only users that did 999 (11%) agreed on scientific data usage. Users that completed 7198 (79%) programs agreed on newsletter subscription and users that completed 8961 (99%) programs turned on notification settings.

Agreement to scientific data usage¶

All of the data¶

Below there will be a subset of data for people that agreed on scientific data usage. There are 20428 observations and only 605 observations are filled with enjoyment value.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20428 entries, 0 to 81320
Data columns (total 22 columns):
 #   Column                   Non-Null Count  Dtype   
---  ------                   --------------  -----   
 0   user_id                  20428 non-null  int64   
 1   program_id               20428 non-null  category
 2   active                   20428 non-null  category
 3   completed                20428 non-null  category
 4   enjoyment                605 non-null    float64 
 5   id                       20428 non-null  category
 6   gender                   20428 non-null  category
 7   height                   20428 non-null  float64 
 8   weight                   20428 non-null  float64 
 9   activity_level           20428 non-null  category
 10  goal                     20428 non-null  category
 11  body_type                20428 non-null  category
 12  body_fat                 20428 non-null  float64 
 13  newsletter_subscription  20428 non-null  bool    
 14  notifications_setting    20428 non-null  bool    
 15  affiliate_code_signup    95 non-null     category
 16  language                 20428 non-null  category
 17  country                  2531 non-null   category
 18  points                   20428 non-null  int64   
 19  scientific_data_usage    20428 non-null  category
 20  BMI                      20396 non-null  float64 
 21  BMI_category             20396 non-null  category
dtypes: bool(2), category(13), float64(5), int64(2)
memory usage: 2.2 MB

The tables below show connected and summarized user_programs and users tables. First table is sorted by count of started programs by users and the second one is sorted by number of points achieved.

count points BMI notification_settings scientific_data_usage
user_id
989 41 526242 23.12 True True
1561 38 568855 23.18 True True
8055 37 42382 21.86 True True
7948 36 4869 26.79 True True
1102 33 8500 23.44 True True
count points BMI notification_settings scientific_data_usage
user_id
889 24 2463230 28.40 True True
698 23 1149712 25.65 True True
1416 20 1104077 25.43 True True
2236 21 1044941 26.37 True True
1331 13 1039974 23.30 True True

User with the biggest number of started programs is user with id 969 (count = 41), then with id 1561 (count = 38), id 8055 (count = 37), id 7948 (count = 36) and id 1102 (count = 33). But in th second table it is seen that user with the biggest number of points have started 24 programs (id 889) and has 2463230 points and BMI says this person is overweight.

Frequency Percent Cumulative Percent
Variable factors
Gender
female 9206.00 45.07% 45.07%
male 11222.00 54.93% 100.0%
Total 20428.00 100.0% -
Activity_level
very active 2020.00 9.89% 9.89%
active 10557.00 51.68% 61.57%
sedentary 7851.00 38.43% 100.0%
Total 20428.00 100.0% -
Goal
lose 9560.00 46.8% 46.8%
gain 8337.00 40.81% 87.61%
antiaging 2531.00 12.39% 100.0%
Total 20428.00 100.0% -
Language
en 260.00 1.27% 1.27%
es 20168.00 98.73% 100.0%
Total 20428.00 100.0% -
Body_type
thin 9494.00 46.48% 46.48%
mid 8285.00 40.56% 87.03%
strong 2649.00 12.97% 100.0%
Total 20428.00 100.0% -
BMI_category
Normal 11412.00 55.95% 55.95%
Obesity 2525.00 12.38% 68.33%
Overweight 5799.00 28.43% 96.76%
Underweight 660.00 3.24% 100.0%
Total 20396.00 100.0% -

In the subset of users who agreed on scientific data usage, most of the programs were started by men - 11222 (55%) and woman started 9206 programs (45% of all started programs). People with activity level active started 10557 programs (52% of all started programs), with level sedentary started 7851 programs (38% of all started programs) and with level vary active - 2020 started programs (10% of all started). People with goal of gaining weight started 8337 programs (41%), with lose - 9560 programs (47%) and with antiaging goal 2531 programs were started (12%). People with English language started 260 programs (1%) and people that used Spanish in their app started 20168 programs (99%). 8285 (41%) started programs were completed by mid body type, people with thin body type started 9494 (46%) programs and people with strong body type started 2649 programs (13%). Most of the programs was started by people with normal weight - 11412 (56%), then second biggest group was overweight group with 5799 started programs (28%), then people with obesity - 2525 (12%) started programs and the smallest group is for people with underweight - 660 (3%) started programs.

Completed programs¶

Below there will be a subset of data for people that agreed on scientific data usage. and completed programs There are 999 observations and 567 observations are filled with enjoyment value.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 999 entries, 14 to 81070
Data columns (total 22 columns):
 #   Column                   Non-Null Count  Dtype   
---  ------                   --------------  -----   
 0   user_id                  999 non-null    int64   
 1   program_id               999 non-null    category
 2   active                   999 non-null    category
 3   completed                999 non-null    category
 4   enjoyment                567 non-null    float64 
 5   id                       999 non-null    category
 6   gender                   999 non-null    category
 7   height                   999 non-null    float64 
 8   weight                   999 non-null    float64 
 9   activity_level           999 non-null    category
 10  goal                     999 non-null    category
 11  body_type                999 non-null    category
 12  body_fat                 999 non-null    float64 
 13  newsletter_subscription  999 non-null    bool    
 14  notifications_setting    999 non-null    bool    
 15  affiliate_code_signup    7 non-null      category
 16  language                 999 non-null    category
 17  country                  352 non-null    category
 18  points                   999 non-null    int64   
 19  scientific_data_usage    999 non-null    category
 20  BMI                      997 non-null    float64 
 21  BMI_category             997 non-null    category
dtypes: bool(2), category(13), float64(5), int64(2)
memory usage: 756.3 KB

The tables below show connected and summarized user_programs and users tables. First table is sorted by count of completed programs by users and the second one is sorted by number of points achieved.

count points BMI notification_settings scientific_data_usage
user_id
13552 9 7400 29.34 True True
1860 8 152111 28.39 True True
3216 8 573934 29.40 True True
2013 8 590448 24.16 True True
1855 8 592526 22.88 True True
count points BMI notification_settings scientific_data_usage
user_id
889 2 2463230 28.40 True True
698 5 1149712 25.65 True True
1416 6 1104077 25.43 True True
2236 4 1044941 26.37 True True
1331 4 1039974 23.30 True True

User with the biggest number of completed programs is user with id 13552 (count = 9), then with id 1860 (count = 8), id 3216 (count = 8), id 2013 (count = 8) and id 1855 (count = 8). In the second table it is seen that user with the biggest number of points have completed 2 programs (id 889) and has 2463230 points and BMI says this person is overweight. This person started 24 programs and finished only 2.

Frequency Percent Cumulative Percent
Variable factors
Gender
female 339.00 33.93% 33.93%
male 660.00 66.07% 100.0%
Total 999.00 100.0% -
Activity_level
very active 108.00 10.81% 10.81%
active 557.00 55.76% 66.57%
sedentary 334.00 33.43% 100.0%
Total 999.00 100.0% -
Goal
lose 386.00 38.64% 38.64%
gain 470.00 47.05% 85.69%
antiaging 143.00 14.31% 100.0%
Total 999.00 100.0% -
Language
en 23.00 2.3% 2.3%
es 976.00 97.7% 100.0%
Total 999.00 100.0% -
Body_type
thin 458.00 45.85% 45.85%
mid 445.00 44.54% 90.39%
strong 96.00 9.61% 100.0%
Total 999.00 100.0% -
BMI_category
Normal 630.00 63.19% 63.19%
Obesity 61.00 6.12% 69.31%
Overweight 286.00 28.69% 97.99%
Underweight 20.00 2.01% 100.0%
Total 997.00 100.0% -

In this subset of users, most of the completed programs, were completed by men - 660 programs (66%) and woman completed 339 programs (34% of all completed programs). People with activity level active completed 557 programs (56% of all completed programs), with level sedentary completed 334 (33% of all completed programs) and with level vary active - 108 completed programs (11% of all completed). People with goal of gaining weight completed 470 programs (47%), with lose - 386 programs (39%) and with antiaging goal 143 programs were completed (14%). People with English language completed 23 programs (2%) and people that used Spanish in their app completed 976 programs (98%). 445 (45%) completed programs were completed by mid body type, people with thin body type completed 458 (46%) programs and people with strong body type completed 96 programs (10%). Most of the programs was completed by people with normal weight - 630 (63%), then second biggest group was overweight group with 286 completed programs (29%), then people with obesity - 61 (6%) completed programs and the smallest group is for people with underweight - 20 (2%) completed programs.